Word Clustering and Disambiguation Based on Co-occurrence Data1

نویسنده

  • Hang Li
چکیده

We address the problem of clustering words (or constructing a thesaurus) based on cooccurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in (Brown, Della Pietra, deSouza, Lai, and Mercer 1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph-based Word Clustering using a Web Search Engine

Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...

متن کامل

Word Clustering and Disambiguat ion Based on Co-occurrence Data

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum ...

متن کامل

A Comparative Study of Word Co-occurrence for Term Clustering in Language Model-based Sentence Retrieval

Sentence retrieval is a very important part of question answering systems. Term clustering, in turn, is an effective approach for improving sentence retrieval performance: the more similar the terms in each cluster, the better the performance of the retrieval system. A key step in obtaining appropriate word clusters is accurate estimation of pairwise word similarities, based on their tendency t...

متن کامل

Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as 'development' can refer to developmental biology or to ...

متن کامل

Co-Occurrence Cluster Features for Lexical Substitutions in Context

This paper examines the influence of features based on clusters of co-occurrences for supervised Word Sense Disambiguation and Lexical Substitution. Cooccurrence cluster features are derived from clustering the local neighborhood of a target word in a co-occurrence graph based on a corpus in a completely unsupervised fashion. Clusters can be assigned in context and are used as features in a sup...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002